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Abstract.  We  study  properties  of  algorithms  which  minimize  (or  almost-minimize)  empirical  error  over 
a  Donsker  class  of  functions.  We  show  that  the  Z/2-diameter  of  the  set  of  almost-minimizers  is  converging 
to  zero  in  probability.  Therefore,  as  the  number  of  samples  grows,  it  is  becoming  unlikely  that  adding 
a  point  (or  a  number  of  points)  to  the  training  set  will  result  in  a  large  jump  (in  L2  distance)  to  a  new 
hypothesis.  We  also  show  that  under  some  conditions  the  expected  errors  of  the  almost-minimizers  are 
becoming  close  with  a  rate  faster  than  n~ 1/2. 
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1.  Introduction 


Let  (Z,  A)  be  a  measurable  space.  Let  P  be  (an  unknown)  measure  on  (Z,A)  and  Z\, . . . ,  Zn  be  indepen¬ 
dent  copies  of  Z  with  distribution  P.  Let  IF  be  a  class  of  functions  from  Z  to  R.  In  the  setting  of  Learning 
Theory,  samples  Z  are  input-output  pairs  ( X ,  Y)  and  for  f  £  T ,  f(Z)  measures  how  well  the  relationship 
between  X  and  Y  is  captured  by  /.  The  goal  is  to  minimize  Pf  =  E f(Z)  where  information  about  the 
unknown  P  is  given  only  through  the  finite  sample  S  =  {Z\, . . .  ,Zn).  Define  the  empirical  measure  as 

Pn  =  £  El Li  Szi- 

Definition  1.  Given  a  sample  S  and  class  T , 

1  n 

fs  ■=  argmin  Pnf  =  argmin  -  Y]  f{Z,) 
fer  fzr  n 

is  a  minimizer  of  the  empirical  risk  ( empirical  error),  if  the  minimum  exists. 


The  Empirical  Risk  Minimization  (ERM)  algorithm  above  has  been  studied  in  Learning  Theory  to  a  great 
extent.  In  this  paper  we  prove  some  properties  of  almost-ERM  algorithms,  which,  to  our  knowledge,  do  not 
appear  in  the  literature.  ERM  is  a  reasonable  strategy  only  if  the  class  T  is  uniform  Glivenko-Cantelli,  that 
is,  T  satisfies  the  uniform  law  of  large  numbers.  In  this  paper  we  focus  our  attention  on  more  restricted 
classes:  Donsker  classes.  These  are  classes  satisfying  not  only  the  law  of  large  numbers,  but  also  a  version 
of  the  central  limit  theorem.  The  specific  structure  of  the  limit  of  this  convergence  will  allow  us  to  control 
correlation  of  the  empirical  means  of  the  minimizers  of  empirical  error. 

Since  an  exact  minimizer  of  the  empirical  risk  might  not  exist,  as  well  as  for  algorithmic  reasons,  we 
consider  the  set  of  almost-minimizers  of  empirical  risk: 


Definition  2.  Given  f  >  0  and  S,  define  the  set  of  almost  empirical  minimizers 

=  {/  e  T  :  Pnf  -  inf  Png  <  £} 

and  define  its  diameter  as 


dianxAd!  =  sup  \\f~g\\- 
f,g£M% 


The  ||-||  in  the  above  definition  is  the  seminorm  on  T  induced  by  symmetric  bilinear  product 

=  e r-pf ). 

This  norm  is  a  natural  measure  of  distance  between  functions,  as  will  become  apparent  later,  because  the 
dot  product  above  is  the  covariance  of  the  limiting  gaussian  process.  Due  to  a  close  relation  of  the  ||-|| 
norm  to  the  ^(P)  norm,  the  results  of  this  paper  will  hold  for  the  L2(P)  norm  as  well. 

Definition  3.  Empirical  Process  vn  indexed  by  T  is  defined  as  the  map 

1  n 

f  -  Mf)  =  Vn(Pn  -  P)f=  -j=  E(/(^)  -  Pf)- 

Definition  4.  A  class  T  is  called  P-Donsker  if 

vn  v 

in  P°{fF),  where  the  limit  v  is  a  tight  Borel  measurable  element  in  £°°(P)  and  ”  ”  denotes  weak 

convergence,  as  defined  on  p.  17  of  [10]. 


In  fact,  it  follows  that  the  limit  process  v  must  be  a  zero-mean  Gaussian  process  with  covariance  function 

Various  Donsker  Theorems  provide  sufficient  conditions  for  checking  if  a  class  is  P-Donsker.  Here  we 
mention  a  few  known  results  (see  e.g.  [10])  in  terms  of  entropy  log M  and  entropy  with  bracketing  logA/jj. 

Proposition  1.  If  /0°°  ^/IogA/jj(e] T ,  L2{P))de  <  oo,  then  T  is  P-Donsker. 

Definition  5.  An  envelope  F  of  the  function  class  T  is  a  measurable  function  with  F  >  |/|  V/  €  F . 
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Proposition  2.  If  the  envelope  F  is  square  integrable  and  /0°°  supg  y  log  J\f(e  ||.F||q  2  >  ^  1  L2(Q))de  <  oo, 
then  F  is  P-Donsker  for  every  P,  i.e.  F  is  universal  Donsker  class.  Here  the  supremum  is  taken  over  all 
finitely  discrete  probability  measures. 

If  F  is  a  {0,  l}-valuecl  class,  then  F  is  uniform  Donsker  class  if  and  only  if  its  VC  dimension  is  finite  (see 
[3]).  Rudelson  and  Vershynin  [7]  extend  Dudley’s  result:  a  class  F  is  uniform  Donsker  if  the  square  root 
of  its  VC  dimension  is  integrable. 


2.  Main  Result 


We  now  state  the  main  result  of  this  paper: 

Theorem  1.  Let  F  be  a  P-Donsker  class.  For  any  sequence  f(n)  =  o(rc-1/2), 

diamAf  5”'*  0. 

The  outer  probability  P*  above  is  due  to  measurability  issues.  Definitions  and  results  on  various  types 
of  convergence,  as  well  as  ways  to  deal  with  measurability  issues  arising  in  the  proofs,  are  based  on  the 
rigorous  book  of  van  der  Vaart  and  Wellner  [10] . 

Corollary  1.  The  result  of  Theorem  1  holds  if  the  diameter  is  defined  with  respect  to  the  L2(P)  norm. 


We  start  the  proofs  with  two  technical  Lemmata. 


Lemma  1.  Let  f0,fi  G  F,  ||/0  -  /i||  >  C/2,  ||/i||  <  ||/0||.  Let  h  :  F  ->  R  be  defined  as  h(f')  = 
Then  for  any  e  <  ^ 


C2 

inf  h  —  sup  h  >  — — . 

®(/o,<0  B(/i,e)  16 


Proof. 


A  :=  inf  h  —  sup  h 

B(fo,e)  B(f  i,e) 

=  M/o)  -  Kh)  +  inf  W  -  fo )  +  h(f  1  -  n\f  G  B(f0,  e)J"  G  B(/i,  e)} 
>  M/o)  -  M/i)  -  if^jj  >  M/o)  -  MA)  -  % 


since  ||/o||  >  C/A. 
Finally 


C2 

II 2  II  r  1 1 2  ,  II  r  1 1 2  -  11  -  -  "2  ^  U 


2  (/o  —  /1./0)  —  ll/o  —  /ill  —  ll/ill  +II/0II  >  II ./o  —  /ill  > 


then 


which  proves  that 


M/o)  -  h(f 1)  > 


C2  C2 

> 


■ll/oll 


2  -  o  > 


.  C2  8e  C2 

A  > - >  — . 

“  8  C  ~  16 


□ 


The  following  Lemma  is  an  adaptation  of  Lemma  2.3  of  [4]. 
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Lemma  2.  Let  fo,fi,h  be  defined  as  in  Lemma  1.  Suppose  e  <  Let  be  a  gaussian  process  on  T 
with  mean  p  and  covariance  cov(u^(f),  ^(/O)  =  (/,/'). 

Then  for  all  5  >  0 

Pr*  (\  sup  v,,  -  sup  <  <5^  < 


B{fi, 


Proof.  Define  the  gaussian  process  F(-)  =  ^(-)  -  h(-)v^{f0).  Since  Cov(F(/'),  ^(/o))  =  (/',/o>  - 
Kf)  || /o|| 2  =  0,  v^fo)  and  F(-)  are  independent. 

We  now  reason  conditionally  with  respect  to  F(-).  Define 


Yi(z)  =  sup  {F(-)  +  h(-)z}  with  i  =  0, 1. 


Notice  that 


Pr*  |  sup  -  sup  Vfj |  <  S\Y  =  Pr*  (|r0(i/M(/0))  -  ri(^(/0))|  <  6) . 

\  B(f0,e)  / 


Moreover  r0  and  Tj  are  convex  and 


C 2 


inf  9_r0  —  sup9+ri  >  inf  h—  sup  h> 

B(fo,e)  B(fi,e)  16 

by  Lemma  1.  Then  r0  =  Ti  in  a  single  point  z0  and 

Pr*  (|r0(^(/o))  ^  ri(i/M(/0))|  <  (5)  <  Pr*  (^(/o)  G  [z0  -  A,  z0  +  A]) , 
with  A  =  1 65 /C2. 

Furthermore, 

325 


Pr*  M/o)  e  [-20  -  A,  z0  +  A])  < 


C'2A/27rVar(t'M(/o)) 


and  Var(r'p(/o))  =  ||/o||2  >  C'2/16,  which  completes  the  proof. 


□ 


The  proof  of  our  main  theorem  relies  on  the  Almost  Sure  Representation  Theorem  (Thm  1.10.4  in  [10]). 
Here  we  state  the  theorem  applied  to  vn  and  v. 

Proposition  3.  Suppose  T  is  P-Donsker.  Let  vn  :  Zn  i— >  be  the  empirical  process.  There  exist  a 

probability  space  (Z' ,  A' ,  P')  and  maps  v'n:  Z'  i— >  £°°(IF)  such  that 

Wf  au  f 

v'n  ->  v' 

(2)  E *  f{v'n)  =  E*/(i/„)  for  every  bounded  f  :  £°°(IF)  >  R.  for  all  n. 

Lemma  1.9.3  in  [10]  in  turn  shows  that  when  the  limiting  process  is  Borel  measurable,  almost  uniform 
convergence  implies  convergence  in  outer  probability.  Therefore,  the  first  implication  of  the  theorem  above 
states  that  for  any  C  >  0 

Pr*  ^sup \i/n-i/\>Cj  -  0. 

We  are  now  ready  to  prove  Theorem  1.  The  reasoning  in  the  proof  goes  as  follows.  We  consider  a  finite 
cover  of  T .  Pick  any  two  almost-minimizers  which  are  ’’far  apart” .  They  belong  to  two  covering  balls  with 
centers  ’’far  apart”.  Because  the  two  almost-minimizers  belong  to  these  balls,  the  infima  of  the  empirical 
risks  over  these  two  balls  are  close.  This  is  translated  into  an  event  that  the  suprema  of  the  shifted  empirical 
process  over  these  two  balls  are  close.  By  looking  at  the  gaussian  limit  process,  we  are  able  to  exploit  the 
covariance  structure  to  show  that  the  suprema  of  the  gaussian  process  over  balls  with  centers  ’’far  apart” 
are  unlikely  to  be  close. 
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Proof  of  Theorem  1.  Fix  C  >  0  and  let  e  =  min(C3/128,  C/4).  Consider  the  e-covering  {fi\i  =  1, . . . ,  A f(e,  T ,  ||-||)}. 
Such  a  covering  exists  because  T  is  totally  bounded  in  ||-||  norm  (see  page  89  in  [10]).  For  any  /,  f  G 
s.t.  11/  —  //||  >  C,  there  exist  k  and  l  such  that  ||/  —  fk\\  <  e  <  C/4,  ||/'  —  /;||  <  e  <  C/4.  By  triangle 
inequality  it  follows  that  ||/fc  —  fi\\  >  C/2. 


Moreover 


and 


inf  Pn  <  inf  Pn  <Pnf<  inf  Pn  +  £(n) 

F  B(fk,e)  F 


inf  Pn  <  inf  Pn  <  Pnf'  <  inf  Pn  +  £(?i). 
F  )  F 


Therefore, 


inf 


P ■„ 


inf 

B(fue) 


<  /(«)■ 


The  last  relation  can  be  restated  in  terms  of  the  empirical  process  vn: 


sup  {vn 
B(fk,e ) 


VnP}  -  sup  {vn 

B(fi,e) 


VnP} 


<  i{n)Vn- 


Now  choose  an  arbitrary  5  >  0  and  fix  ns  s.t.  for  n  greater  that  ns  the  l.h.s.  in  the  above  relation  is  less 
than  5.  Then  Vn  >  ns 


PC  (diamA4|(n)  >  c)  =  PC  (sf,  f  G  M |(n),  ||/  -  f\\  >  c) 


< 


PC  3/,  k  s.t. 


||/fe  -  //||  >  C/2, 


sup  {vn 

B(fk,e) 


VnP}  -  sup  {vn 
B(fut) 


By  union  bound 


fdiamM|{n)  >  C)  <  V  PC 

sup  {vn  -  VnP}  -  sup  {vn  —  VnP} 

V  /  z '  \ 

k:l  —  l  \ 

B(fk,e)  B(ft,e) 

\\fk-fi\\>C/2 

We  now  want  to  bound  the  terms  in  the  sum  above.  By  the  Almost  Sure  Representation  Theorem,  there 
exist  a  probability  space  ( Z',A',P ')  and  maps  v'n  :  Z '  i— >  (P°(fF)  such  that  Pr*  (sup^  \v’n  —  i/'\)  — >  0  and 
vn  and  v’n  have  the  same  distribution.  Assuming  without  loss  of  generality  that  ||/fc||  >  ||//||,  we  obtain 


Pr* 


sup  {vn 
\  B(fk,e) 


VnP}  -  sup  {vn 

B(fi,e) 


=  Pr* 


sup  {V 

V  :) 


Vnl}  -  sup  {v’n 
B(fi  ,e) 


=  Pr* 


(sup  {V 

B(fk,e) 


VnP  +  V 


v'}  —  sup  }v'  —  VnP  +  v'n 
B(fi,e) 


<  Pr* 


(sup  {C 

B(fk,e) 


VnP}  —  sup  {V  —  VnP} 

B(fi,e ) 


<2 5  +  Pr*  (  sup  \v'n  —  v'\  >  5/ 2 


F 


< 


128A 
C 3“ 


+  Pr*  sup  | v'n  -  v'\  >  <5/2  , 


F 


where  the  first  inequality  results  from  a  union  bound  argument  while  the  second  one  results  from  Lemma 
2  noticing  that  v'  —  VnP  is  a  gaussian  process  with  covariance  (/,/')  and  mean  — VnP ,  and  since  by 
construction  e  <  C3/128. 

Finally  we  have 

Pr*  (diamA4(n)  >  c)  <  A f(e,  V7,  ||-||)2  +  Pr*  ^sup  \u'n  -v'\>  5/2 j  j 
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and  the  thesis  follows  from  the  arbitrariness  of  6.  □ 

Proof  of  Corollary  1.  Note  that 

11/  -  f'\\l2  =  11/  -  /'f  +  (P(f  ~  f)f  ■ 

The  expected  errors  of  almost  minimizers  over  a  Glivenko-Cantelli  (and  therefore  over  Donsker)  class  are 
close  because  empirical  means  converge  to  the  expectations. 

Pr*(a/,/'€A4(n)s.t.  ||/-/'||ia  >C) 

<  Pr*  (d/,  /'  G  Mpn)  s.t.  | Pf  -  Pf  |  >  C/V 2)  +  Pr*  (diamAf|(n)  >  C/V?) 

The  first  term  can  be  bounded  as 

Pr*  (3/,  f  G  M|(n)  s.t.  | Pf  -Pf  |  >  C/V 2) 

<  Pr*  (3/,  f  G  P,  | Pnf  -  Pnf  |  <  £(n),  | Pf  -  Pf  |  >  C/V2) 

<  Pr*  (  sup  K(/-  /')l  >  Vn\C/V2-£(n)\  ) 

which  goes  to  0  because  the  class  {/  —  f\f,  f  G  P}  is  P-Donsker.  The  second  term  goes  to  0  by  Theorem 
1.  '  □ 


3.  Stability  of  almost-ERM 

Corollary  2  shows  stability  of  almost-ERM  on  Donsker  classes.  It  implies  that,  in  probability,  the  L2  (and 
thus  Lfj  distance  between  almost-minimizers  on  similar  training  sets  (with  o(fn)  changes)  is  decreasing. 

This  result  provides  a  partial  answer  to  the  questions  raised  in  the  Machine  Learning  literature  by  [6,  8] :  is 
it  true  that  when  one  point  is  added  to  the  training  set,  the  ERM  algorithm  is  less  and  less  likely  to  jump 
to  a  far  (in  the  L\  sense)  hypothesis?  In  fact,  since  binary-valued  function  classes  are  uniform  Donsker 
if  and  only  if  the  VC  dimension  is  finite,  Corollary  2  proves  that  almost-ERM  over  binary  VC  classes 
possesses  L\  stability.  For  the  real-valued  classes,  the  uniform  Glivenko-Cantelli  property  is  strictly  more 
general  than  the  uniform  Donsker  property,  and  therefore  it  remains  unclear  if  almost-ERM  over  uGC  but 
not  uniform  Donsker  classes  is  stable  in  the  L\  sense.  This  provides  a  partial  answer  to  the  question  raised 
in  [8],  where  L\  stability  over  uGC  classes  was  conjectured. 

Use  of  Li  stability  goes  back  to  Devroye  and  Wagner  [2],  who  showed  that  it  is  sufficient  to  bound  the 
difference  between  the  leave-one-out  error  and  the  expected  error  of  a  learning  algorithm.  In  particular, 
Devroye  and  Wagner  show  that  nearest-neighbor  rules  possess  L\  stability  (see  also  [1]).  Our  Corollary  2 
implies  L\  stability  of  ERM  (or  almost-ERM)  algorithms  on  Donsker  classes. 

It  is  known  that  exact  empirical  risk  minimization  is  an  NP-hard  problem  even  for  simple  function  classes. 
An  interesting  further  direction  of  research  is  to  see  whether  the  result  of  Corollary  2  can  have  algorithmic 
consequences. 

Corollary  2.  Assume  P  is  P-Donsker  and  uniformly  bounded  with  envelope  F  =  1.  For  I  CN,  define 
S(I)  =  Let  In  C  N  such  that  Mn  :=  \In  A  [1  :  n]|  =  o(n1/2).  Suppose  fn  G  and 

fn  G  A ^s(i\  /or  some  £(n)  =  o(n-1/2)  and  f(n)  =  o(n-1/2)  .  Then 

Wfn  /nil  0. 

The  norm  ||-||  can  be  replaced  by  L2(P)  or  Li(P)-norm. 
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Proof.  It  is  enough  to  show  that  f'n  G  for  some  £"(n)  =  o(n  1/2)  and  result  follows  from  the 

Theorem  1. 


-  E  fniZi)  < 


»e[  i" 


t  +  ;E«w 

»€/„ 


< 


< 


v  r  (n)  +  “  TO  E  3(^) 


|/n|  " 


1-TnL//  x  1  f  (7  \ 

—  +  — C  (n)  +  -  ^  fn(Zi) 


ieln 


<  2^  +  ^E(n)  +  -  £  /„(£*) 

n  n  n  z — ' 


<  2—  +  — £'(n)  +  £(n)  +  inf  -  ^  g(Zi) 
n  n  ntzT «  z — ' 


ge-77  n 


i6[  l:n] 


Define  £"(n)  :=  2^f-  +  ^£'(n)  +  £(n).  Because  Mn  =  o(v'n),  i.e.  the  two  sets  are  not  very  different, 
it  follows  that  £"(n)  =  o(n-1/2).  Corollary  1  implies  convergence  in  L2(P),  and,  therefore,  in  Li(P) 
norm.  □ 


4.  Expected  Error  Stability  of  almost-ERM 

We  show  that  if  a  bound  on  the  rate  of  decrease  of  the  diameter  in  Theorem  1  is  available,  then,  under 
some  conditions  on  the  class,  the  difference  between  expected  errors  of  almost-minimizers  decays  faster 
than  n-1/2.  Similarly  to  the  previous  section,  this  implies  that  ERM  is  stable  in  the  sense  that  when  the 
training  set  is  perturbed,  the  difference  of  expected  errors  decays  faster  than  jit1/2. 

From  the  proof  of  Theorem  1,  the  rate  of  decrease  of  the  diameter  is  bounded  by  the  rate  of  convergence  of 
the  empirical  process  to  the  gaussian  process.  Some  results  on  the  rate  of  such  convergence  can  be  found  in 
[5].  In  the  following  Corollary,  we  will  assume  the  rate  of  decay  of  the  diameter  is  known  and  a  condition 
on  the  metric  entropy  growth  is  satisfied. 

Corollary  3.  Let  T  be  a  uniformly  bounded  function  class  with  the  envelope  function  T  =  1.  Assume 
=  supg  Q,  7)  <  00  for  0  <  7  <  1  and  Q  ranging  over  all  discrete  probability  measures.  Let 

AI5"'1  be  defined  as  above  with  £(n)  =  o(n-1/2)  and  assume  that  for  some  A (n)  =  o^r1^2) 

(1)  A(n)diamA^5n^  >  0. 

Suppose  further  that 

(2)  A(n)1/2  -  log M{JF,  n-1/2A(n)-1/4)  ->  +00. 

Then 


Vn  sup  \P(f  -  f)\  0. 

In  particular,  if  T  is  a  VC-subgraph  class,  the  condition  (2)  is  satisfied  whenever  the  diameter  decays  faster 
than  log2  n,  i.e.  A(n)/log2  n  — >  00. 

The  proof  relies  on  the  following  ratio  inequality  of  Pollard  [9] : 

Proposition  4.  Let  Q  be  a  uniformly  bounded  function  class  with  the  envelope  function  G  =  2.  Assume 
=  supg  hf\(G,  Q,  27)  <  00  for  0  <  7  <  1  and  Q  ranging  over  all  discrete  probability  measures. 

Then 

Pr*  (tE/T+™E  >  26)  £  32Ar<«^“i><-”7> 
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Proof  of  Corollary  3.  Define  Q  =  {/  —  /'  :  /,  /'  £  T}  and  Q'  =  { | /  —  f'\  :  /,  f  €  P}.  It  can  be  shown  that 
P,  G,  and  Q'  are  Donsker  classes  (see  [10]).  In  particular,  M[G,2rf)  <  A//H7)2  and  the  envelope  of  Q  is 
G  =  2.  Apply  Proposition  4  to  the  class  Q: 


Pr* 


( \Pn(f  -  n  ~  pu  -  n\ 

\  f  f'aT  e(PJ/  -  f\  +  P\  f  -  PI)  + 


<  32J\f(P,  y/2)2  exp(— ney). 


The  inequality  therefore  holds  if  the  sup  is  taken  over  a  smaller  (random)  subclass  A4 


?(«). 
s  • 


Pr* 


sup 


m  -  n  -  an) 

c(Pn\f  —  f'\  +  P\f  —  /I)  +  5y 


<  32A/”(iF, y/2)2 exp(— ney). 


Since  suPa;  ^ 


a  n 


supx  A(g-) 
supx  B(x)  ’ 


Pr*  |  sup  |P(/-/')|-?(n)>26  sup  (e(Pn|/ - /'|  +  P\f  -  f'\)  +  57) 
H'£A4(n)  /,/'ex|(n) 

<  32Af(P,  7/2)2exp(— ney). 


By  assumption, 

A(n)  sup  P|/  —  /,|  — ►  0. 

/,/'ex|(n) 


Because  £/'  is  Donsker  and  A(n)  =  o(n1'2), 

AH  sup  |P„|/-/'HP|/-/'||^0. 

/,/'eM|(n) 


Thus, 

A(n)  sup  P„|/-/,|+P|/-/,|  -^>0. 

Now  choose  en  =  n_1/2A(n)a  and  7n  =  n_1/2A(n)_/3  for  any  0  <  (3  <  a  <  1.  Then  nen 7„  =  A(?i)Q_/3  and 
n1/2A(n)1_“  sup  e„  (P„|/ - /']  +  P|/ - /'|) -^  0. 

For  the  sake  of  simplicity,  set  a  =  3/4  and  (3  =  1/4. 

By  definition  of  limit,  for  any  5  >  0,  there  exist  Ng  such  that  for  all  n  >  Ng, 

Pr*  [  Vn  sup  26  (en  (P„|/  -  f'\  +  P\f  -  f'\)  +  57n)  >  2AH"1/4  I  <  <5. 

V  /,/'£M|(n)  / 

Thus, 

Pr*  (  ^Jn  sup  |P(/  —  f)\  <  \/nf(n)  +  2A(n)-1^4  ]  >  1— 32A/"(P,  A(n)-1/4)2  exp(— A(n)1^2)— 5. 

V  /  2 

The  result  follows  by  the  assumption  on  the  entropy  and  by  arbitrariness  of  S. 

If  P  is  a  VC  subgraph  class  of  dimension  V,  its  entropy  numbers  logA/"(P,  e)  behave  like  Vlog^,  i.e. 
log  A/”(P,  ?r_1/2A(  n)  lP)  behaves  like  Vlog?i  +  VlogA(n).  Condition  (2)  will  therefore  hold  whenever 
A(?r)  grows  faster  than  log2  n.  □ 
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